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Abstract — A  strong  physically  unclonable  function  (PUF) 
is  a  circuit  structure  that  extracts  an  exponential  number  of 
unique  chip  signatures  from  a  bounded  number  of  circuit 
components.  The  strong  PUF  unique  signatures  can  enable  a 
variety  of  low-overhead  security  and  intellectual  property  pro¬ 
tection  protocols  applicable  to  several  computing  platforms. 
This  paper  proposes  a  novel  lightweight  (low  overhead)  strong 
PUF  based  on  the  timings  of  a  classic  processor  architecture. 
A  small  amount  of  circuitry  is  added  to  the  processor  for  on- 
the-fly  extraction  of  the  unique  timing  signatures.  To  achieve 
desirable  strong  PUF  properties,  we  develop  an  algorithm 
which  leverages  intentional  post-silicon  aging  to  tune  the  inter¬ 
chip  and  intra-chip  signature  variation.  Our  evaluation  results 
show  that  the  new  PUF  meets  the  desirable  inter-  and  intra¬ 
chip  strong  PUF  characteristics,  while  its  overhead  is  much 
lower  than  the  existing  strong  PUFs.  For  the  processors  imple¬ 
mented  in  45nm  technology,  the  average  inter-chip  Hamming 
distance  for  32-bit  responses  is  increased  by  16.1%  after 
applying  our  post-silicon  tuning  method;  the  aging  algorithm 
also  decreases  the  average  intra-chip  Hamming  distance  by 
98.1%  (for  32-bit  responses). 

Index  Terms — Physically  unclonable  function,  multi-core 
processor,  secure  computing  platform,  post-silicon  tuning, 
circuit  aging,  negative  bias  temperature  instability 

I.  Introduction 

Achieving  secure  and  trustworthy  computing  and 
communication  is  a  grand  challenge.  Several  known 
data/program  security  and  trust  methods  leverage  a  root 
of  trust  in  the  processing  units  to  achieve  their  goals. 
Microprocessors  and  other  heterogeneous  processing  cores 
-  which  form  the  kernels  of  most  modern  computing 
and  communication  -  have  become  increasingly  mobile, 
limiting  the  amount  of  available  energy  and  resources. 
Traditional  security  and  trust  methods  based  on  classic 
cryptography  are  often  computationally  intensive  and  thus 
undesirable  for  low  power  portable  platforms.  Mobility  and 
low  power  also  favor  smaller  and  simpler  form  factors  that 
are  unfortunately  known  to  be  more  susceptible  to  attacks 
such  as  side-channels  or  invasive  exploits.  There  is  a  search 
for  low  overhead  and  attack-resilient  security  methods  that 
operate  on  low  power  computing  platforms. 

Physically  unclonable  function  (PUF)  is  a  promising 
circuit  structure  to  address  the  pending  security  needs 
of  several  portable  and  resource-constrained  computing 
platforms.  Thanks  to  the  unique  and  unclonable  process 
variations  (PVs)  on  each  chip,  PUFs  can  generate  specific 
signatures  for  each  manufactured  IC.  Technically,  PVs 
mainly  affect  threshold  voltage  ( Vth )  or  effective  gate 
length  ( Leff )  of  the  devices  in  a  chip  [  1  ]  [2],  These  unique 
device  characteristics  can  be  measured  by  the  structural 
side-channel  tests  such  as  timing  or  current  of  specific 
test  vectors.  To  ease  integration  into  higher-level  digital 
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security  primitives,  it  is  desirable  to  transform  the  measured 
structural  test  results  to  digital  values.  The  unclonability 
and  inherent  uniqueness  properties  of  signatures  makes 
PUF  an  attractive  security  primitive  choice  [3], 

PUF  signatures  are  typically  extracted  by  a  challenge- 
response  protocol.  In  response  to  a  challenge  (or  input), 
the  PUF  generates  a  unique  response  (or  output )  that  is 
dependent  on  the  specific  PV  of  the  underlying  chip.  PUFs 
have  been  classified  into  two  broad  categories;  Weak  and 
Strong.  Weak  PUFs  have  a  limited  number  of  challenge- 
response  pairs  (CRPs),  which  restricts  their  application 
scenarios  to  those  requiring  a  few  secret  bits  such  as  key 
generation.  Strong  PUFs  generate  an  exponential  number  of 
CRPs  from  a  limited  number  of  circuit  components.  Strong 
PUFs  enable  a  wider  range  of  security  and  trust  protocols 
by  leveraging  their  huge  space  of  CRPs. 

Although  the  already  proposed  strong  PUFs  have  shown 
promising  results  [4],  their  application  is  still  limited  due  to 
their  non-negligible  overhead  and  instability.  For  example, 
AEGIS  secure  processor  design  [5]  which  realizes  a  trust¬ 
worthy  hardware  platform,  has  a  non-negligible  hardware 
overhead  of  the  added  logic  including  the  arbiter  PUF  for 
supporting  secure  execution.  Apart  from  the  PUF  logic 
itself,  a  large  portion  of  hardware  overhead  often  comes 
from  error  correction  logic.  Since  PUFs  should  be  able 
to  produce  stable  outputs  under  various  environmental 
conditions  (e.g.,  voltage  and  temperature  fluctuations),  error 
correction  logic  overhead  is  inevitable,  yet  desired  to  be 
reduced.  Moreover,  natural  PUFs  may  have  undesirable 
statistical  distributions  in  terms  of  inter-chip  variations, 
which  significantly  restricts  their  practical  applicability. 
The  statistical  distribution  becomes  even  worse  when  spa¬ 
tial  correlations  between  the  device  characteristics  due  to 
process  variation  (in  particular,  systematic  variations)  are 
prevalent  across  the  chips. 

In  this  paper,  we  introduce  an  alternative  strong  PUF 
architecture,  based  on  a  conventional  multi-core  processor. 
Our  PUF  design  is  a  realization  of  a  low-overhead  and  sta¬ 
ble  strong  PUF.  By  leveraging  the  built-in  structures  (adders 
in  AFUs)  in  typical  multi-core  microprocessors  instead  of 
building  additional  delay  logic  (e.g.,  a  series  of  switches 
and  a  series  of  inverter  chains  in  arbiter  PUFs  and  ring 
oscillator  (RO)  PUFs  [6],  respectively),  our  design  realizes 
a  low-overhead  and  secure  strong  PUF  which  can  be  em¬ 
ployed  to  many  security  applications.  A  proof-of-concept 
implementation  is  demonstrated  on  a  two-core  architecture. 
To  further  improve  security,  reliability,  and  stability  of  the 
PUFs  as  well  as  make  up  for  possible  drawbacks  of  the 
two-core  PUF  design,  we  also  propose  a  systematic  post¬ 
silicon  tuning  method  for  our  PUF.  Our  new  algorithm 
leverages  an  intentional  aging  method  based  on  one  of 
the  most  significant  circuit  aging  mechanisms:  negative 
bias  temperature  instability  (NBTI)  [7].  Our  proposed  post¬ 
silicon  aging  algorithm  does  not  incur  any  performance 
overhead  in  most  of  the  chips  by  careful  consideration  of 
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selecting  the  gates  that  will  be  intentionally  aged.  Also,  our 
algorithm  greatly  improves  statistical  properties  of  our  PUF 
design  in  terms  of  both  inter-chip  and  intra-chip  variations. 

Our  main  contributions  include: 

•  We  propose  a  low  overhead  strong  PUF  design,  two- 
core  PUF,  which  leverages  built-in  components  in 
general  processor  architectures; 

•  Our  new  PUF  design  shows  good  statistical  results, 
comparable  to  the  previously  proposed  strong  PUF 
designs.  The  hardware  overhead  of  the  new  PUF  is 
lower  than  the  previously  proposed  ones; 

•  We  propose  a  systematic  method  to  further  enhance 
statistical  properties  of  our  multi-core  PUF  in  terms  of 
both  inter-chip  and  intra-chip  variations  by  leveraging 
intentional  aging,  which  complements  the  possible 
drawbacks  of  our  PUF  design; 

•  Our  simulation  results  on  a  two-core  architecture  prove 
that  our  intentional  aging  algorithms  successfully  im¬ 
prove  the  statistical  property  of  the  two-core  PUF  with 
negligible  performance  overhead  in  most  cases. 

The  rest  of  this  paper  is  organized  as  follows.  Section 

II  outlines  background  information  for  process  variation, 
delay  model,  and  circuit  aging  mechanism/model.  Section 

III  explains  our  two-core  PUF  design  while  Section  IV 
introduces  our  systematic  tuning  method  by  leveraging  in¬ 
tentional  aging  to  tune  the  statistical  properties  of  the  intro¬ 
duced  PUF.  Evaluation  results  for  the  two-core  realization 
and  intentional  aging  algorithms  are  discussed  in  Section  V. 
Section  VI  provides  a  brief  review  of  the  recent  literatures 
regarding  PUFs  and  intentional  post-silicon  aging  methods. 
Lastly,  we  conclude  in  Section  VII. 

II.  Background  and  Preliminaries 

In  this  section,  we  provide  general  background  infor¬ 
mation  and  preliminaries  for  process  variation,  delay,  and 
aging  mechanism.  The  background  and  preliminaries  are  to 
make  the  paper  self-contained  and  accessible  to  a  broader 
audience  who  may  not  be  familiar  with  process  variation, 
delay  model,  and  aging. 
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Fig.  1.  Four  process  variation  map  examples  generated  by  quad-tree 
process  variation  model  [1],  The  number  in  the  right  side  of  the  figures 
means  Z  value  of  Gaussian  distribution. 


B.  Delay  model 

To  figure  out  the  Vi/j- dependent  gate  delay,  we  use  the 
delay  model  described  in  [8].  The  gate-level  delay  model 
can  be  represented  as  follows: 


Delay  oc  (— j —  )2  x 
4>t 


Vdd 


( ln{e 


U  +  ^)Vdd-Vth 


+  1))2 


(1) 


where  (j>t  and  a  are  thermal  voltage  and  subthreshold  slope, 
respectively.  There  are  several  other  key  factors  that  affect 
gate-level  delay:  supply  voltage  (Vdd),  threshold  voltage 
(Vth),  and  effective  gate  length  (Leff).  Due  to  process 
variations,  these  factors  fluctuate,  which  in  turn  results  in 
delay  differences  across  the  gates  in  chips.  Furthermore, 
circuit  aging  (it  will  be  covered  in  detail  in  Section  II-C) 
also  affects  gate  delay  since  circuit  aging  increases  Vth  of 
the  gate. 


A.  Process  variation 

Process  variation  (PV)  generates  inherent  randomness  in 
silicon  structures.  PV  mainly  affects  threshold  voltage  (Vth) 
and  effective  gate  length  (Te//)  of  devices,  resulting  in 
various  side-effects  (e.g.,  delay  and  power  consumption) 
across  chip  instances. 

PV  can  be  classified  into  two  broad  categories:  random 
and  systematic  variation.  Random  variation  is  caused  by 
random  dopant  fluctuations  or  random  defects  in  devices. 
Random  variation  does  not  have  any  spatial  correlation 
between  the  devices.  Unlike  random  variation,  systematic 
variation  incurs  spatially  correlated  device  fluctuations.  It 
means  that  the  devices  which  are  close  together  have  a 
higher  probability  to  have  similar  device  characteristics 
than  those  located  far  away.  In  contemporary  process  tech¬ 
nologies,  both  random  and  systematic  variation  coexist  in 
manufactured  chips. 

Figure  1  shows  sample  Vth  distribution  maps  generated 
by  a  quad-tree  PV  model  [1].  Vth  distribution  is  shown  to 
be  fairly  random  in  a  single  chip  as  well  as  across  the  chips, 
while  similar  colors  tend  to  agglomerate  together  (i.e.,  Vth 
distributions  are  spatially  correlated). 


C.  Aging  model 

Circuit  aging  is  a  phenomenon  in  which  performance 
of  the  circuits  is  degraded  by  the  circuit  usage.  This  may 
eventually  result  in  a  malfunction  of  the  circuit  under 
intensive  utilizations  or  extreme  environmental  conditions 
(e.g.,  extremely  high  temperature).  Compared  to  fresh  chips 
(i.e.,  not  aged),  aged  chips  have  relatively  lower  perfor¬ 
mance  due  to  Vth  shift  by  hot  carrier  injection  (HCI)  and 
negative  bias  temperature  instability  (NBTI).  Vth  of  devices 
is  continuously  increased  as  those  devices  are  switched  or 
have  a  high  duty  cycle,  resulting  in  higher  delay  and  lower 
power  consumption. 

In  deep  submicron  process  technologies,  NBTI  is  known 
to  be  the  most  threatening  aging  mechanism  [7].  Thus,  in 
this  paper,  we  consider  NBTI  as  our  main  aging  mecha¬ 
nism.  The  Vth  shift  (A  Vth)  by  NBTI  is  commonly  modeled 
as  follows: 

A  Vth  =  Ax  e(Wfl)  x  x  f025  (2) 

where  Vg  and  Ea  are  gate  voltage  and  activation  energy 
respectively.  A  and  B  are  technology  dependent  constants. 
As  shown  in  Equation  2,  the  Vth  shift  heavily  depends  on 
temperature  (T)  and  stress  time  (t).  By  applying  this  aging 
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model,  one  can  derive  an  appropriate  stress  time  ( t )  under 
a  certain  temperature  ( T )  to  intentionally  increase  a  certain 
amount  of  Vth- 

Stress  time  t  is  strongly  dependent  on  the  signal  prob¬ 
ability  (SP)  [9]  that  represents  a  fraction  of  time  when  a 
gate  output  stays  logic  high  (1)  during  the  circuit  operation. 
Depending  on  SP  of  a  gate,  Vth  of  the  gate  will  be  increased 
(stress  period)  or  decreased  (recovery  period).  Hence,  to 
make  the  gate  intentionally  aged,  one  should  carefully 
determine  SP  of  the  gate  so  that  it  stays  in  the  stress  period 
much  more  than  in  the  recovery  period. 
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III.  Two-core  PUF 
A.  Design  philosophy  and  design  decisions 

1 )  Base  platform  -  multi-core  microprocessor:  Since  our 
design  is  fundamentally  based  on  the  delay  comparison 
mechanism  of  arbiter  PUFs,  we  need  symmetric  (homoge¬ 
neous)  structures  to  generate  diverse  path  delays  affected 
by  process  variations.  The  symmetric  multi-core  micropro¬ 
cessor  is  one  of  the  best  design  candidates  since  most  com¬ 
modity  microprocessors  (or  microcontrollers)  have  multiple 
homogeneous  cores. 

Typical  strong  PUF  designs  have  separate  delay  circuits 
to  generate  PUF  responses,  which  incur  additional  area  and 
power  overhead.  In  contrast,  our  PUF  design  utilizes  built- 
in  components  in  typical  multi-core  microprocessors,  which 
minimizes  additional  hardware  and  communication  over¬ 
head.  Compared  to  the  AEGIS  design  [5]  which  employs 
separate  switches  to  implement  an  arbiter  PUF,  our  design 
is  implementable  with  a  much  smaller  logic  overhead. 

2)  Path  delay  source  -  ALUs:  Our  design  chooses  ALUs 
as  path  delay  sources.  The  main  reason  is  that  ALUs  can 
accept  an  exponential  number  of  operands,  which  can  also 
be  used  as  challenge  inputs.  Moreover,  they  can  generate 
challenge-dependent  responses  when  using  add  instructions 
by  stimulating  the  complex  carry-chains  in  adder  structures. 
Add  instructions  can  have  an  exponential  number  of  differ¬ 
ent  operands  (264  with  32-bit  operands)  and  our  PUL  can 
also  generate  an  exponential  number  of  diverse  responses 
depending  on  the  challenge  inputs  as  well  as  disorders  in 
silicon  structures.  It  means  our  ALU-based  PUL  design  can 
be  classified  as  a  strong  PUF. 

The  other  reason  for  choosing  ALUs  as  path  delay 
sources  is  that  ALUs  are  combinational  logics  in  micro¬ 
processors  and  they  have  delay  paths  which  are  comprised 
of  a  long  series  of  gates.  It  makes  adversaries  difficult 
to  perform  a  model  building  attack.  This  is  because  the 
adversaries  should  perform  multiple  stages  of  gate-level 
delay  table  lookups  and  additions  to  obtain  the  accurate 
path  delays  through  their  PUF  model.  Determination  of 
carry  propagation  behaviors  also  introduce  a  lot  of  control 
dependencies,  which  means  it  is  difficult  for  adversaries 
to  exploit  the  massively  parallel  computations  in  order  to 
acquire  a  PUF  response  time  comparable  to  that  from  the 
real  PUF  hardware.  In  this  case,  one  can  give  a  timing 
constraint  (time-bound)  during  the  PUF  challenge  in  order 
to  distinguish  the  real  PUF  and  the  modeled  PUF.  Time- 
bounded  authentication  by  PUF  has  been  introduced  earlier 
[10]. 

Our  PUF  design  can  be  applied  to  any  adder  structures, 
though  in  this  paper  we  build  our  PUF  based  on  ripple- 
carry  adders  (RCAs)  as  a  proof-of-concept.  In  fact,  PUFs 
are  broadly  used  in  small  embedded  systems  (e.g.,  sen¬ 
sor  nodes  or  RFIDs)  [11][12]  or  FPGAs  [13][14][15]  in 
which  RCAs  are  more  beneficial  for  energy-efficiency  than 


Fig.  2.  The  basic  structure  of  our  two-core  PUF  (bit  width=4-bit). 


high-performance  adders  such  as  carry-lookahead  adders 
(CLAs).  Note  that  the  first  design  consideration  of  those 
embedded  systems  is  typically  energy-efficiency,  not  per¬ 
formance. 

B.  Overall  design 

1 )  PUF  design:  Delay-based  PUFs  [6]  exploit  delay 
differences  between  multiple  paths  which  have  inherently 
different  delays  across  chips  due  to  process  variations. 
One  may  deploy  arbiters  (or  counters/comparators  in  case 
of  ring-oscillator  PUFs)  to  capture  the  delay  difference 
between  two  delay  lines  and  convert  it  into  a  digitized 
value.  In  this  paper,  we  propose  an  alternative  strong  PUF 
design  which  utilizes  already  built-in  components  in  a 
processor  architecture  as  our  delay  lines  instead  of  building 
separate  delay  lines  (e.g.,  a  series  of  the  switches  in  arbiter 
PUFs  or  a  series  of  the  inverters  in  ring  oscillator  PUFs). 

Although  our  new  strong  PUF  can  be  built  based  on 
any  multi-core  processor  architecture,  in  the  remainder 
of  the  paper  we  focus  on  a  two-core  proof-of-concept 
design.  Generalization  to  more  cores  is  straightforward. 
Figure  2  shows  a  high-level  design  of  our  two-core  PUF. 
For  simplicity,  we  provide  a  simple  4-bit  two-core  PUF 
design  in  this  figure.  Our  PUF  utilizes  arithmetic  logic 
units  (ALUs)  in  the  multi-core  microprocessors/controllers 
as  symmetric  delay  lines.  In  order  to  give  a  challenge  input 
to  the  PUF,  the  identical  challenge  program  runs  in  both 
cores.  As  shown  in  Figure  2,  two  4-bit  operands  (operand  A 
and  B )  are  fed  into  each  ALU  and  a  4-bit  output  (S\  ~  Sf) 
can  be  obtained  from  each  ALU.  For  delay  comparison, 
the  7i- th  output  lines  (Sn)  from  each  ALU  are  connected  to 
the  n-th  arbiter  ( Arbitern ).  The  challenge  program  should 
start  at  the  same  cycle  in  both  cores  to  guarantee  correct 
PUF  operations.  Note  that  the  arbiters  in  the  circuit  layout 
should  be  very  carefully  placed  for  correct  operations  of 
the  two-core  PUF.  In  addition,  the  wire  lengths  from  two 
ALUs  to  the  arbiter  should  be  symmetric  not  to  generate 
biased  PUF  outputs. 

In  our  proof-of-concept  example,  bitwidth  of  our  base 
microprocessor  is  32-bit.  Hence,  each  core  has  a  32-bit 
ALU.  Sn  from  CoreO  and  Corel  are  connected  to  the 
Arbitern,  where  n  is  1-32.  Thus,  we  need  32  arbiters  for 
delay  comparison.  Note  that  our  design  can  be  easily  ex¬ 
tended  to  64-bit  microprocessors  by  simply  adding  32  more 
arbiters  and  connecting  the  corresponding  ALU  output  ports 
to  those  arbiters. 

2)  Security  enhancement  by  XOR  obfuscation:  Typical 
security  applications  desire  a  high  inter-response  variations 
(i.e.,  high  unpredictability).  A  low  inter-response  variation 
may  make  the  PUF  vulnerable  to  the  modeling  attack  [16] 
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Fig.  5.  A  more  detailed  structure  of  our  two-core  PUF.  For  simplicity. 
Fig.  3.  Additional  logic  for  XOR  obfuscation.  only  one  arbiter  and  one  temporary  register  (flip-flop)  are  shown  in  the 

figure.  The  XOR  obfuscation  logic  is  drawn  in  a  dashed-line  since  it  is 
an  optional  logic. 


because  only  a  small  set  of  CRPs  may  enable  an  accurate 
modeling  of  a  specific  PUF  by  adversaries.  For  better  inter¬ 
response  variations  of  our  PUF  design,  one  can  deploy 
an  additional  XOR  obfuscation  step  between  two  different 
response  bits  as  described  in  [17], 

By  paying  a  little  more  hardware  cost,  one  can  perform 
an  XOR  operation  between  *-th  bit  and  (i+bltw*d™)-th  bit 
from  a  response,  as  shown  in  Figure  3.  PUF  operations 
should  be  performed  twice  with  different  challenges  in 
order  to  generate  a  bitwidth-bit  response,  which  also  incurs 
timing  overhead.  Considering  the  trade-off  among  the  hard¬ 
ware  cost,  performance,  and  security,  one  can  employ  the 
additional  XOR  obfuscation  step  only  for  the  case  where  a 
high  level  of  security  is  required. 

As  shown  in  Figure  4,  the  inter-response  variation  is 
greatly  improved  by  adding  the  XOR  obfuscation  step. 
Comparing  between  the  case  with  and  without  XOR  ob¬ 
fuscation,  an  average  inter-response  Hamming  distance  is 
increased  from  5.06  bits  to  10.64  bits  and  from  11.81  bits 
to  20.53  bits  when  using  32-bit  and  64-bit  two-core  PUF, 
respectively. 

C.  Detailed  design  and  architectural  modifications 

Delay  characteristics  in  our  PUF  depend  on  the  carry 
propagation  behavior  in  the  conventional  ripple-carry  adder 
(which  is  included  in  ALUs).  As  shown  in  Figure  5,  two 
operands  (A,  and  Bi )  are  fed  into  the  full  adders.  Between 
the  full  adders,  there  are  carry  bits  ( Ci ),  which  depend  on 
the  operands  ( Ai  and  Bi)  and  previous  carry  bit  (C',;_  i ). 
Depending  on  the  carry  bit,  delay  characteristics  of  the 
full  adder  rely  on  those  of  either  the  preceding  full  adders 
or  only  the  current  full  adder.  These  carry  propagation 
behaviors  generate  an  exponential  number  of  the  signal 
propagation  behaviors  in  the  adder,  which  eventually  en¬ 
ables  a  generation  of  challenge-dependent  PUF  outputs. 
The  summation  result  bits  (Si)  from  the  ALU  (in  each 
core)  are  connected  to  the  arbiters.  S,  is  also  connected 
to  the  ALU  output  storage  which  is  already  implemented 
in  general  processor  architectures,  though  it  is  not  shown  in 
Figure  5.  The  signals  from  two  separate  ALUs  race  to  the 
arbiter,  which  in  turn  generates  a  digitized  output  depending 
on  which  delay  line  is  faster.  The  arbiter  output  is  stored 
to  a  temporary  register  (‘PUF  Response,’  in  Figure  5). 

As  we  explained  in  Section  III-B2,  the  response  bits  may 
be  XOR-ed  together  (Lth  bit  ®  (i+  bitwjdth  )-th  bit)  and  the 
XOR-ed  results  are  finally  stored  into  one  half  of  the  final 


An  example  assembly  code  for  one-time  PUF  query 

1 

addi 

$1,  $0,  A  (or  from  register) 

#  load  operand  A  to  register  rl 

2 

addi 

$2,  $0,  B  (or  from  register) 

#  load  operand  B  to  register  r2 

3 

add 

$5,  $0,  $0 

#  initialization  for  delay  measurement 

4 

add 

$3,  $1,  $2 

#  the  first  add  operation  -  $r3=$rl+$r2 

5 

addi 

$5,  so,  OxffMff 

#  initialization  for  delay  measurement 

6 

add 

$3,  $1,  $2 

#  the  second  add  operation  -  $r3=$rl+$r2 

Fig.  6.  An  example  challenge  program  (instruction  sequence)  for  one¬ 
time  PUF  query  (bitwidth=32-bit). 


output  register  ($rp:  a  special  purpose  register  to  store  the 
output  from  the  two-core  PUF).  The  other  half  of  the  output 
register  is  filled  by  performing  the  PUF  operation  once 
again  with  different  challenge  inputs.  After  the  results  are 
stored  to  the  PUF  output  register,  the  challenge  program 
can  access  this  register  for  later  usages. 


D.  Challenge  procedure 

In  order  to  give  challenge  inputs  to  our  two-core  PUF, 
we  utilize  a  software-level  challenge  program.  Figure  6 
shows  an  example  program  for  a  PUF  query  based  on 
MIPS  assembly  codes.  One-time  PUF  query  is  performed 
as  follows.  Before  starting  the  PUF  operation,  the  operands 
(A  and  B)  are  loaded  into  the  registers  (Line  1-2  in 
Figure  6).  The  actual  PUF  operation  is  performed  by 
four  consecutive  addition  operations  (Line  3-6  in  Figure 
6).  Among  these  four  add  instructions,  the  instructions 
in  Line  3  and  5  in  Figure  6  are  used  to  initialize  the 
ALU  output  ports  to  ‘O’  and  ‘1’,  respectively.  In  addition, 
these  instructions  also  initialize  the  signals  in  the  carry 
propagation  chains  (from  C\  to  C32)  to  ‘O’.  The  add 
instmctions  in  Line  4  and  6  in  Figure  6  are  to  perform 
an  actual  PUF  operation  by  stimulating  the  internal  gates 
in  the  ALUs.  The  instructions  in  Line  3-4  and  Line  5-6 
are  dedicated  to  capture  0-4-1  and  1  — >  0  transitions 
in  the  arbiter,  respectively.  In  this  work,  we  use  dual¬ 
trigger  latches  (arbiters)  to  capture  both  up  (0  — >  1)  and 
down-transitions  (1  — >•  0).  Note  that  the  operating  system 
can  block  the  other  program  execution  during  the  PUF 
operation  to  prevent  the  unintended  resource  (ALU)  sharing 
which  may  incur  cycle-level  discrepancy  between  the  two 
cores. 
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Fig.  4.  Inter-response  Hamming  distance  variations  when  10,000  random  different  inputs  are  fed  into  the  two-core  PUF.  The  x-axis  and  y-axis 
corresponds  to  the  Hamming  distances  and  probability  mass  function. 


E.  Practicality  issues 

Since  our  design  utilizes  an  in-built  structure  (adder) 
instead  of  the  specialized  circuit  for  PUF,  some  imple¬ 
mentation  issues  may  arise.  In  this  subsection,  we  address 
several  practicality  issues  of  the  two-core  PUF  design. 

1 )  Intermediate  signal  fluctuations  in  the  output  port:  In 
the  general  circuit  structures,  there  could  be  some  ripples 
(fluctuations  of  the  signal  before  capturing  the  true  signal) 
in  the  output  port.  If  the  multiple  input  ports  are  connected 
to  one  output  port,  these  fluctuations  may  occur  because 
signal  propagation  delays  from  those  input  ports  connected 
to  the  output  port  are  likely  to  be  diverse.  Therefore,  if  the 
path  delay  sources  for  a  delay-based  PUF  are  generated 
from  the  general  circuit  structures,  it  could  be  problematic 
due  to  the  ambiguity  of  when  to  capture  the  transition  signal 
in  the  arbiters  (i.e.,  selecting  the  signal  to  capture). 

However,  in  the  case  of  a  ripple-carry  adder  that  con¬ 
stitutes  the  path  delay  sources  in  the  two-core  PUF,  the 
signal  in  the  output  ports  fluctuates  at  most  twice.  In 
most  cases,  the  first  and  second  output  signal  fluctuations 
result  from  the  operands  fluctuations  (i.e.,  when  Ai  and 
Bi  are  fed  into  the  full  adder)  and  carry  propagations  (i.e., 
a  signal  transition  in  U,;_i),  respectively.  Once  the  carry 
signal  (Ci- 1)  is  converted  from  0  to  1,  it  does  not  make 
a  transition  into  0  again  within  one  add  operation,  which 
restricts  the  maximum  number  of  possible  transitions  in  the 
output  port  of  the  adder  to  2. 

There  can  be  6  different  cases  of  signal  fluctuations  cap¬ 
tured  by  the  arbiter  in  our  PUF:  0— >-1— >-0,  1— >0— »•  1, 
0  — 1,  1  — »•  0,  0  — »•  0  (not  fluctuating  from  0),  and  1  — »  1 
(not  fluctuating  from  1).  Among  them,  only  the  cases  of 
0  — >  1  and  1  — ►  0  generate  valid  outputs  in  the  arbiters. 
In  the  other  cases,  the  values  generated  in  the  arbiter  are 
ignored.  The  following  subsection  describes  sorting  of  valid 
and  invalid  output  bits. 

2 )  Sorting  of  the  valid  and  invalid  output  bits:  In  order 
to  make  use  of  only  valid  output  bits,  one  may  need  addi¬ 
tional  MUXes  between  the  arbiter  and  temporary  register 
to  generate  desired  PUF  outputs. 

As  shown  in  Figure  7,  one  can  deploy  a  MUX  between 
the  arbiter  and  temporary  register.  By  referring  to  the 
control  signal,  the  MUX  selects  the  value  either  from  the 
arbiter  or  from  the  temporary  register.  Control  signals  can 
be  generated  by  referring  to  the  summation  result  bit  (Si). 
In  the  first  phase  of  the  PUF  query,  which  corresponds  to 
Line  3-4  in  Figure  6,  Si  is  directly  fed  into  the  control 
port  of  the  MUX.  If  Si  is  ‘1’,  the  MUX  selects  the  value 


Fig.  7.  Selection  of  the  valid  PUF  outputs  by  using  a  MUX. 


from  the  arbiter  whose  output  is  generated  by  capturing 
0  — ►  1  transitions.  Otherwise,  the  arbiter  output  is  ignored 
by  selecting  the  temporary  register  value  in  the  MUX.  In 
contrast,  the  negation  of  Si  is  fed  into  the  control  signal 
of  the  MUX  in  the  second  phase  (Line  5-6  in  Figure  6) 
to  identify  the  valid  arbiter  output  generated  by  capturing 
1  — >•  0  transitions. 

3)  Runtime  temperature  difference  between  two  cores: 
Since  our  two-core  PUF  design  is  based  on  the  structures 
in  different  processor  cores,  there  may  be  a  temperature 
difference  between  two  cores  which  may  incur  delay  dif¬ 
ferences  (i.e.,  delay  behavior  may  be  biased).  Since  thermal 
behaviors  of  the  two  cores  are  likely  to  be  diverse  depend¬ 
ing  on  characteristics  of  the  program  previously  executed 
before  the  PUF  operation,  it  may  make  our  PUF  responses 
different  from  the  expected  responses. 

To  deal  with  different  thermal  behaviors  of  two  ALUs, 
we  can  employ  thermal  sensors  to  detect  the  temperature 
difference  between  the  ALUs.  Typical  microprocessors 
already  have  the  thermal  sensors  in  their  expected  localized 
hotspots  [18],  which  means  one  does  not  need  to  deploy 
additional  thermal  sensors  only  for  the  two-core  PUF  To 
guarantee  the  PUF  operation  correctness,  operating  systems 
(OSes)  can  read  the  temperature  from  the  thermal  sensors 
before  the  PUF  operation  begins.  If  there  is  a  temperature 
difference  between  two  ALUs,  the  OS  cools  the  hotter  ALU 
down  by  enforcing  the  sleep  mode.  Though  it  may  incur 
performance  overhead  due  to  the  sleep  period  in  one  core, 
the  performance  loss  is  insignificant  in  the  authentication 
process  (i.e.,  it  is  not  performed  in  general  program  run¬ 
time,  but  only  in  authentication  program  runtime). 

For  a  design-level  solution,  one  can  utilize  two  ALUs 
from  one  core  in  the  case  of  superscalar  processors.  The 
ALUs  in  one  core  are  likely  to  have  similar  thermal 
behaviors  due  to  their  close  physical  distance.  Otherwise, 
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one  can  also  add  redundant  ripple-carry  adders  in  the 
microprocessor,  which  will  yield  a  little  more  hardware 
overhead,  though  our  PUF  has  only  a  small  implementation 
overhead  (a  detailed  analysis  on  the  hardware  overhead  will 
be  described  in  the  following  subsection). 

F.  Implementation  overhead 

Typical  microprocessors  or  microcontrollers  already  have 
several  cores  or  ALUs  to  support  multi-programmed/multi- 
threaded  workloads  or  higher  instruction-level  parallelism 
(ILP).  Our  PUF  design  realizes  a  strong  PUF  with  much 
lower  hardware  overhead  by  leveraging  built-in  compo¬ 
nents.  Assuming  one  builds  a  two-core  PUF  based  upon 
already  built-in  ALUs  in  a  32-bit  processor,  an  additional 
hardware  cost  is  only  arbiters,  MUXes,  and  temporary 
storage  for  32-bit  data.  If  one  needs  an  additional  XOR 
obfuscation  stage,  only  additional  XOR  gates  are  to  be 
added.  Even  if  one  builds  a  two-core  PUF  without  an 
underlying  processor  architecture,  our  PUF  design  only 
needs  96  2-to-l  MUXes  (or  288  NANDs),  128  XOR  gates, 
32  arbiters,  and  32  flip-flops  including  the  logic  shown  in 
Figure  7.  Compared  to  the  conventional  arbiter  PUF  (32- 
input/32-output)  which  needs  2048  2-to-l  MUXes  and  32 
arbiters,  our  PUF  design  incurs  far  less  hardware  cost.  As 
a  result,  our  PUF  design  yields  much  lower  area/power 
overhead  compared  to  the  conventional  strong  PUF  designs. 

IV.  Post-silicon  Tuning  of  Two-core  PUF  via 
Intentional  Aging 

A.  High-level  description  of  our  post-silicon  tuning 

1 )  Rationale:  Though  our  two-core  PUF  provides  fairly 
good  statistical  distributions  in  general  cases  (see  Section 
V-B),  the  manufactured  PUFs  may  not  show  sufficiently 
good  statistical  properties  in  practice.  In  this  case,  one  may 
have  to  discard  manufactured  PUF  chips  due  to  the  low 
quality  statistical  properties,  which  results  in  yield  losses 
of  the  chips. 

The  two  possible  problematic  conditions  for  our  manu¬ 
factured  PUFs  include: 

•  Low  inter-chip  variations:  This  case  may  often  hap¬ 
pen  (particularly  for  two-core  PUFs)  because  the  two 
ALUs  in  each  core  are  not  close  together,  which  may 
result  in  systematic  bias  between  two  ALUs  across 
the  chip  instances  [19].  It  can  also  be  incurred  by 
suboptimal  layouts  (i.e.,  asymmetric  placement  of  the 
arbiters  and  different  wire  lengths  between  each  ALU 
to  the  arbiters)  as  well  as  inherent  process  variation.  In 
this  case,  regardless  of  the  process  variation  in  chips, 
arbiter  outputs  would  be  biased  to  either  ‘0’  or  ‘U, 
which  in  turn  results  in  losing  the  uniqueness  of  the 
PUF  instances  (i.e.,  reduced  randomness  among  the 
PUF  outputs  from  different  chip  instances). 

•  High  intra-chip  variations:  Delay-based  PUFs  are  of¬ 
ten  susceptible  to  various  environmental  conditions. 
PUFs  should  be  able  to  produce  stable  outputs  even 
under  extreme  environmental  situations.  Fluctuating 
environmental  conditions  include  voltage/temperature 
variation  and  arbiter  metastability.  Since  circuit  delay 
is  heavily  dependent  on  the  voltage/temperature  vari¬ 
ations,  the  PUF  output  might  also  be  diverse  under 
voltage/temperature  variations.  Arbiter  metastability  is 
another  source  of  PUF  output  instabilities.  In  general 
arbiter-based  delay  PUFs,  if  the  delay  difference  be¬ 
tween  two  delay  lines  connected  to  one  arbiter  is  less 


i-th  full  adder  in  the  two-core  PUF 


Fig.  8.  i-th  full  adder  structure  in  the  two-core  PUF. 


than  the  setup  and  hold  time  of  the  arbiter,  the  PUF 
output  is  not  stable  and  may  fluctuate  depending  on 
environmental  conditions. 

Manufactured  PUF  instances  should  avoid  those  two 
conditions  that  definitely  degrade  quality  of  the  PUFs.  In 
this  paper,  we  introduce  a  systematic  intentional  aging 
method  to  make  the  statistical  quality  of  the  two-core  PUF 
much  better  in  terms  of  both  inter-  and  intra-chip  variations. 
Our  aging  method  complements  the  possible  drawbacks  of 
our  PUF  design. 

2)  Strategy:  Since  the  aging  process  is  a  one-way  pro¬ 
cess  and  may  degrade  circuit’s  performance,  a  careful 
intentional  aging  strategy  is  desired.  In  particular,  our 
PUF  design  leverages  in-built  structures  as  our  path  delay 
sources  (i.e.,  not  deploying  additional  dedicated  circuits). 
In  this  case,  aging  may  in  turn  degrade  the  entire  circuit 
performance.  In  pipelined  processors,  though  the  execution 
stage  where  a  processor  performs  ALU  operations  [20] 
does  not  typically  lie  in  the  critical  path  of  processors  [21], 
performance  of  a  few  chips  which  have  their  critical  path  in 
the  execution  stage  may  be  adversely  affected  by  increased 
ALU  delay  after  the  intentional  aging  process. 

Our  aging  strategy  is  to  apply  intentional  aging  only 
to  the  gates  which  do  not  lie  in  the  critical  path  of  the 
adder.  Figure  8  shows  the  structure  of  a  full  adder  which 
is  a  substructure  of  our  two-core  PUF.  The  critical  path  of 
the  entire  adder  is  a  carry  propagation  chain  (NAND2  and 
NAND3  gates),  which  implies  XOR1,  XOR2,  and  NANDI 
gates  except  those  in  the  last  full  adder  (FA)  do  not  affect 
the  critical  path  delay.  In  summary,  the  XOR1,  XOR2,  and 
NANDI  are  safe  to  apply  the  intentional  aging  while  the 
NAND2  and  NAND3  gates  might  be  very  sensitive  to  the 
circuit’s  entire  performance.  Thus,  to  minimize  side-effects 
from  the  intentional  aging,  we  selectively  apply  intentional 
aging  only  to  XOR1,  XOR2,  and  NANDI  gates  in  the  full 
adder. 

A  careful  selection  of  the  full  adders  that  must  be 
intentionally  aged  is  also  important.  In  our  PUF  design,  the 
n-th  bit  response  is  closely  related  to  delay  characteristics 
of  the  n-th  full  adder.  Our  strategy  is  to  utilize  statistical 
metrics  to  determine  which  full  adders  have  bad  statistical 
properties.  First,  we  figure  out  which  PUF  output  bits  show 
a  relatively  bad  statistical  quality  by  investigating  its  output 
behaviors  (i.e.,  the  probability  of  occurring  ‘0’  and  ‘1’  in 
each  PUF  output  bit).  And  then,  we  choose  the  full  adders 
which  correspond  to  those  PUF  output  bit.  Our  main  targets 
for  intentional  aging  algorithms  are  XOR1,  XOR2,  and 
NANDI  gates  in  the  selected  full  adders. 

3 )  Figuring  out  the  input  vectors  for  aging:  Figure  9 
shows  how  to  generate  input  vectors  for  intentional  aging. 
One  input  vector  is  an  operand  A=0xffffffff  (unsigned)  and 
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To  apply  the  intentional  aging  to  the  i-th  full  adder 


First 

Operand  A  1 1  1  | 

T]  - 

i 

1 

Oxffffffff 

input  ■ 

Operand  B  0  |  0  | 

o]  - 
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0x00000000 

vector 

Initial  carry  bit  (C0 

=i 

Second 
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0 
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The  i-th  bit:  1 

vector 

Initial  carry  bit  (C0 

=  0 

T 

The  remaining  bits:  0 

i-th  bit 


Fig.  9.  Input  vector  generation  for  our  intentional  aging  process. 


TABLE  I 

A  TRUTH  TABLE  OF  FULL  ADDERS. 


State 

A, 

Bi 

Ci-i 

XOR1 

XOR2 

NANDI 

NAND2 

NAND3 

0 

0 

0 

0 

0 

0 

1 

1 

0 

1 

0 

0 

1 

0 

1 

1 

1 

0 

2 

0 

1 

0 

1 

1 

1 

1 

0 

3 

0 

1 

1 

1 

0 

1 

0 

1 

4 

1 

0 

0 

1 

1 

1 

1 

0 

5 

1 

0 

1 

1 

0 
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0 

1 

6 

1 

1 

0 

0 

0 

0 

1 

1 

7 

1 

1 

1 

0 

1 

0 

1 

1 

B=0x00000000  (unsigned)  with  an  initial  carry  bit  Cq=1 
(see  Figure  5),  assuming  32-bit  two-core  PUF  is  used.  The 
first  input  vector  (two  operands)  is  stationary  regardless  of 
which  full  adder  (FA)  is  aged.  For  the  other  input  vector, 
operand  A  is  an  operand  of  all  ‘0’  (A=0x00000000).  The 
operand  B  of  the  second  input  vector  has  different  bit 
sequences  depending  on  which  full  adders  are  intentionally 
aged.  The  i-th  bit  of  the  operand  B  is  ‘1’  if  the  i-th  FA 
must  be  aged  and  the  rest  of  bits  are  all  ‘O’ .  For  example, 
let  us  suppose  that  the  first  and  third  FA  should  be  aged.  In 
this  case,  one  can  make  the  operand  B  of  the  second  input 
vector  as  0x00000005  (i.e.,  within  the  32-bit  operand,  only 
the  first  and  third  bit  are  ‘1’  and  all  other  bits  are  ‘0’)-  Note 
that  the  initial  carry  bit  of  the  second  input  vector  should 
be  ‘0’  (C'o=0).  For  our  aging  process,  the  first  and  second 
input  vectors  are  fed  into  the  two-core  PUF  alternately. 

Our  aging  input  vector  generation  leverages  the  stress 
and  recovery  mechanism  of  CMOS  NBTI  [22].  When  a  gate 
has  an  output  of  ’1’,  the  current  passes  through  the  PMOS 
devices,  which  means  the  gate  is  in  the  stress  period  of 
NBTI.  Otherwise,  the  gate  is  in  the  recovery  period.  Thus, 
to  age  a  gate,  one  should  enforce  the  gate  in  the  stress 
period  more  than  the  recovery  period. 

As  we  explained  in  Section  IV-A2,  our  goal  is  to  age 
only  XOR  gates  and  NANDI  gate  in  the  full  adder.  The 
first  input  vector  enforces  all  of  the  full  adders  to  reside  in 
the  state  5  in  Table  I.  In  this  cycle,  XOR1,  NANDI,  and 
NAND3  gates  are  in  the  stress  period  (i.e.,  gate  output=l) 
while  XOR2  and  NAND2  gates  are  in  the  recovery  period 
(i.e.,  gate  output=0).  In  the  next  cycle,  by  using  the  second 
input  vector,  the  full  adders  which  must  be  aged  are 
enforced  to  be  in  the  state  2  in  Table  I  while  the  other 
full  adders  are  in  the  state  0. 

Table  II  shows  the  ratio  between  the  stress  and  recovery 
period  when  our  first  and  second  aging  input  vector  are 
alternately  fed  into  the  two-core  PUF.  As  a  result,  only 
XOR1  and  NANDI  gate  are  aged  while  the  other  gates 
are  minimally  affected  because  the  stress  and  recovery 
period  occur  alternately.  Though  it  seems  that  NANDI 
gates  are  aged  far  more  than  the  other  gates,  NANDI  gates 
hardly  affect  output  delays  since  they  are  neither  directly 


TABLE  II 

Stress/recovery  period  ratio  and  duty  cycle  of  each  gate  in 
A  FULL  ADDER  (FA). 


FA  in 

which  aging  is  applied 

FA  in  which  aging  is  not  applied 

Stress  period 

Recovery  period 

Duty  cycle 

Stress  period 

Recovery  period 

Duty  cycle 

XOR1 

100% 

0% 

1 

50% 

50% 

0.5 

XOR2 

50% 

50% 

0.5 

0% 

100% 

0 

NANDI 

100% 

0% 

1 

100% 

0% 

1 

NAND2 

50% 

50% 

0.5 

50% 

50% 

0.5 

NAND3 

50% 

50% 

0.5 

50% 

50% 

0.5 

connected  to  the  paths  to  the  arbiters  nor  placed  on  the 
critical  path  of  the  adder. 

NAND2  and  NAND3  gates  may  also  be  a  little  aged 
together  due  to  the  partial  recovery  mechanism  of  NBTI. 
In  the  case  of  stress  period=50%  and  recovery  period=50% 
(i.e.,  duty  cycle=0.5),  the  gate  is  aged  2~3  times  less 
than  the  gate  with  a  stress  period  of  100%  [7].  However, 
assuming  that  one  increases  Vth  by  0.1V  via  our  aging 
process,  the  entire  impact  on  the  critical  paths  of  the  adder 
is  only  7.53%  in  the  worst  case.  Note  that  increasing  Vth  by 
0.  IV  is  fairly  sufficient  to  obtain  a  good  statistical  property 
of  our  PUF  (detailed  results  will  be  shown  in  Section 
V-C).  It  means  that  our  aging  process  hardly  affects  the 
entire  circuit  performance  because  most  processors  have 
their  critical  path  in  the  cache  access  (MEM  stage)  or 
register  file  access  pipeline  stage  (RF/ID  stage)  [21].  Note 
that  there  is  no  additional  hardware  overhead  required  only 
for  our  intentional  aging  which  can  be  performed  with  the 
specialized  input  vectors  or  programs. 

During  our  aging  process,  the  wires  as  well  as  logic 
gates  would  be  aged.  However,  the  NBTI  aging  mechanism 
mostly  affects  PMOS  devices  [7],  which  means  the  aging 
in  wires  is  negligible  compared  to  the  aging  in  the  logic 
gates. 

For  efficient  intentional  aging  in  the  post-silicon  stage, 
designers  or  manufacturers  can  perform  the  intentional 
aging  process  with  appropriately  high  temperature  envi¬ 
ronment.  We  also  note  here  that  the  high  temperature 
environment  used  in  our  intentional  aging  process  should 
not  incur  any  break  down  in  devices,  but  only  accelerate 
aging  process. 

4)  Sample  spaces  to  measure  the  statistical  properties: 
There  are  two  types  of  the  sample  spaces  which  are  used 
in  our  aging  algorithms:  inter-chip  sample  space  and  intra¬ 
chip  sample  space. 

•  inter-chip  sample  space:  it  is  composed  of  the  PUF 
responses  from  different  chip  instances  when  the  same 
challenge  input  is  fed.  Note  that  one  can  feed  a  large 
number  of  challenges  to  generate  a  representative  and 
sufficiently  large  number  of  inter-chip  sample  spaces. 
In  this  paper,  we  give  1,000,000  random  challenge  in¬ 
puts  to  the  PUF  instances  (i.e.,  we  generate  1,000,000 
inter-chip  sample  spaces ). 

•  intra-chip  sample  space:  the  bit  samples  are  the  PUF 
responses  from  the  same  chip  with  the  same  challenge 
under  different  environmental  conditions.  The  sample 
space  is  composed  of  1 1  PUF  responses  and  each  re¬ 
sponse  is  extracted  under  different  environmental  con¬ 
ditions.  There  are  three  factors  considered  to  generate 
different  environmental  conditions:  voltage  variation, 
temperature  variation,  and  arbiter  metastability.  For 
generating  1 1  different  environmental  parameters,  the 
voltage  and  temperature  are  randomly  selected  within 
the  range  of  1.0V-1.2V  and  253K-393K,  respectively. 
Arbiter  metastability  is  also  considered  as  a  source  of 
unstable  PUF  responses.  For  high  representativeness 
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Alg.  1  Algorithm  for  intentional  aging  to  N-bit  two-core  PUF  to 
increase  inter-chip  variations. 

Alg.  2  Algorithm  for  intentional  aging  to  N-bit  two-core  PUF  to 
reduce  intra-chip  variation. 

Input:  Statistical  distribution  of  the  PUF  results 
in  the  inter-chip  sample  space 

Input:  Statistical  distribution  of  the  PUF  results 
in  the  intra-chip  sample  space 

1 

for  i  t—  1  to  k  II  Main  loop 

1 

for  *  <—  1  to  k  //  Main  loop 

2 

Update  PUF  results  and  statistical  distribution; 

2 

Update  PUF  results  and  statistical  distribution; 

3 

for  j  <—  1  to  N 

3 

for  j  <—  1  to  N 

4 

Pj  <—  prob.  of  occurrence  of  ‘I’ 
in  j-th  bit  within  the  responses 
across  different  chip  instances; 

(i.e.,  in  the  inter-chip  sample  spaces); 

4 

Pj  <—  prob.  of  occurrence  of  4'  in  j-th  bit 
from  the  responses  in  the  sample  space 
across  various  environmental  parameters; 

(i.e.,  in  the  intra-chip  sample  spaces); 

5 

if (Pj  >  0.6) 

5 

if (Pj  >  0.5) 

6 

Apply  aging  ( Vth  increment  of  0.01  V) 
to  j-th  FA  in  CoreO; 

6 

Apply  aging  (Vth  increment  of  0.01  V) 
to  j-th  FA  in  Corel; 

7 

iff  P,  <  0.4) 

7 

else 

8 

Apply  aging  ( Vth  increment  of  0.01  V) 
to  j-th  FA  in  Corel; 

8 

Apply  aging  (Vth  increment  of  0.01  V) 
to  j-th  FA  in  CoreO; 

9 

endfor 

9 

endfor 

10 

endfor 

10 

endfor 

of  the  samples,  1,000,000  different  intra-chip  sample 
spaces  are  generated  with  1,000,000  different  chal¬ 
lenge  inputs  for  each  chip  sample. 

B.  Aging  algorithm  to  increase  inter-chip  variations 

To  increase  inter-chip  variations,  one  should  make  the 
probability  occurring  ‘0’  and  ‘T  as  close  as  possible  for 
each  response  bit  across  PUF  instances  to  minimize  a  bias 
in  the  responses.  For  Mil  PUF  response  bit,  i-th  full  adder 
(FA)  mainly  contributes  to  the  delay  to  i-th  arbiter.  Thus, 
in  the  case  that  one  tries  to  change  the  i-th  bit  response, 
one  can  selectively  apply  the  aging  process  to  the  i-th  full 
adder.  Depending  on  the  occurring  frequency  of  0  and  1 
for  each  response  bit  in  the  inter-chip  sample  spaces,  one 
can  determine  which  core’s  i-th  full  adder  must  be  aged. 

Alg.  1  shows  a  detailed  algorithm  to  make  inter-chip 
variations  of  the  two-core  PUFs  higher.  For  the  input  of  this 
algorithm,  the  statistical  distribution  of  the  PUF  responses 
in  the  inter-chip  sample  spaces  is  required.  To  determine 
which  core’s  full  adders  must  be  aged,  our  algorithm  in¬ 
vestigates  the  Pj  which  represents  probability  of  occurring 
‘1’  in  j-th  bit  in  the  inter-chip  sample  spaces  (Line  4  in 
Alg.  1).  In  this  paper,  when  updating  the  PUF  statistical 
distributions  with  the  generated  inter-chip  sample  spaces, 
we  use  1,000,000  random  challenge  programs  which  are 
newly  selected  for  each  iteration.  Thus,  the  bias  towards 
a  certain  set  of  challenges  (i.e.,  1,000,000  challenges  used 
for  updating  the  statistical  distributions)  is  removed.  If  Pj  is 
greater  than  0.6,  it  means  that  the  j-th  FA  in  the  coreO  tends 
to  be  faster  than  that  in  the  Corel.  Thus,  our  algorithm  apply 
the  aging  process  to  the  j-th  FAs  in  CoreO.  Our  algorithm 
increases  Vth  of  the  XOR1  gate  in  the  FA  by  0.01V  for 
each  iteration  of  the  intentional  aging  process.  The  input 
vectors  for  our  aging  process  are  alternately  fed  into  the 
PUF  until  the  Vth  of  the  XOR1  gate  is  increased  by  0.01V. 
By  using  Equation  2,  one  can  obtain  an  appropriate  stress 
time  to  increase  Vth  by  0.01V  under  a  certain  temperature. 
The  input  vector  sequence  for  selectively  applying  the  aging 
process  to  a  specific  FA  is  already  explained  in  Section 
IV- A3.  On  the  other  hand,  if  Pj  is  less  than  0.4,  our 
algorithm  applies  the  aging  process  to  j-th  FAs  in  Corel 
to  make  Corel’s  FAs  slower  than  before.  This  process  is 
iterated  k  times  and  the  number  of  iterations  ( k )  can  be 


determined  by  considering  the  degree  of  the  delay  bias. 
Note  that  this  algorithm  is  applied  globally  to  the  chips. 


C.  Aging  algorithm  to  reduce  intra-chip  variation 

To  reduce  intra-chip  variation,  one  should  make  the  delay 
difference  between  two  delay  lines  (which  are  connected  to 
the  arbiters)  larger  so  that  the  PUF  can  be  stable  under  a 
certain  degree  of  the  environmental  variabilities.  By  doing 
so,  we  try  to  make  the  probability  of  occurrence  of  unstable 
PUF  responses  as  low  as  possible. 

Alg.  2  describes  our  algorithm  to  reduce  intra-chip 
variations.  With  a  given  or  updated  statistical  distribution  of 
the  PUF  responses  in  intra-chip  sample  spaces  (as  in  Alg. 
1,  one  million  random  challenges  which  are  newly  chosen 
for  each  iteration  are  used),  our  algorithm  applies  the  aging 
process  to  make  the  PUF  responses  more  stable.  For  j=l 
to  32,  if  Pj  is  greater  than  0.5  (i.e.,  CoreO’s  j-th  FA  tends 
to  be  faster  than  Corel’s  j-th  FA),  then  our  algorithm  ages 
the  j-th  FA  in  Corel  to  make  the  delay  difference  between 
the  FAs  in  CoreO  and  Corel  larger  than  before.  Otherwise, 
our  algorithm  ages  j-th  FAs  in  CoreO.  Similar  to  Alg.  1, 
Vth  °f  the  XOR1  gate  in  the  FA  is  increased  by  0.0  IV  for 
each  iteration  and  the  main  loop  is  repeated  by  k  times. 
Unlike  Alg.  1,  Alg.  2  is  individually  applied  to  each  chip. 

Applying  the  intentional  aging  by  using  our  algorithms 
may  incur  a  huge  post-processing  cost.  In  case  that  the  huge 
post-processing  overhead  is  expected,  one  can  selectively 
apply  the  Alg.  2  to  the  PUF  chips  of  which  stability  does 
not  meet  a  quality  standard,  which  can  be  determined  by 
PUF  designer  or  manufacturer  considering  the  field  usage  of 
the  PUFs.  There  can  also  be  an  efficient  trade-off  between 
the  post-processing  cost  and  PUF  quality  standard  (or  yield 
of  the  manufactured  PUF),  though  analyzing  the  detailed 
trade-off  between  them  is  out  of  scope  of  this  paper. 


D.  Reliability  and  security  discussions  on  aging 

For  our  PUF  design,  there  are  several  security,  reliability, 
and  maintainability  issues  on  aging:  malicious  usage  of  our 
aging  algorithm  and  natural  aging  effect.  In  this  subsection, 
we  address  those  issues  and  introduce  possible  countermea¬ 
sures. 
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1 )  Malicious  usage  of  our  algorithm:  An  adversary  may 
try  to  make  the  PUF  responses  of  a  certain  chip  as  he  or 
she  desires  by  using  our  algorithms’  capability  of  changing 
the  PUF  responses.  For  example,  the  adversary  can  try 
to  use  our  aging  algorithm  in  the  opposite  way,  which 
may  make  the  statistical  property  of  our  PUF  worsened. 
One  possible  way  to  prevent  this  attack  is  to  deploy  aging 
sensors  [23][24],  which  can  detect  how  much  the  circuit 
is  aged  by  measuring  the  frequency  of  ring  oscillators  or 
delay  elements.  If  the  aging  sensor  detects  a  certain  degree 
of  the  aging  within  a  short  time  period,  the  OS  can  enforce 
the  PUF  to  reside  in  a  sleep  mode  so  that  the  ALU  can 
be  cooled  down  and  stop  executing  the  malicious  code.  As 
another  solution,  already  employed  thermal  sensors  can  also 
detect  the  execution  of  the  malicious  code  for  malicious 
aging.  This  is  because  the  malicious  code  which  tries  to 
age  the  PUF  by  an  adversary  should  intensively  access  the 
ALU,  which  makes  it  significantly  hot.  It  triggers  dynamic 
thermal  management  (DTM)  to  prevent  thermal  emergency 
in  a  microprocessor  [18]  [25],  The  DTM  also  cools  down 
the  ALU  by  engaging  the  coercive  sleep  mode  in  the 
microprocessor.  Since  the  NBTI  aging  heavily  depends  on 
the  circuit  temperature,  cooled  ALUs  are  affected  little  from 
the  malicious  aging  by  the  adversary. 

2)  Natural  aging  and  recovery  due  to  the  NBTI:  The 
natural  aging  may  affect  the  PUF  responses.  However,  there 
are  two  important  reasons  which  support  the  claim  that 
our  PUF  design  is  safe  against  the  natural  aging  effect, 
i)  Our  PUF  is  based  on  the  delay  comparison  between  two 
symmetric  delay  lines  by  using  the  arbiter.  In  real  usage 
cases,  both  paths  are  generally  aged  together,  so  that  the 
PUF  responses  are  not  likely  to  be  affected  by  natural  aging 
effects,  ii)  NBTI  mechanism  has  a  recovery  period  (i.e., 
increased  Vth  due  to  the  aging  is  recovered  to  a  certain 
extent)  when  PMOS  devices  are  not  used.  It  means  that 
the  PUF  structure  may  not  be  aged  too  much  under  the 
assumption  that  the  most  of  the  gates  have  a  duty  cycle  of 
0.5.  For  a  safeguard  mechanism,  one  can  also  deploy  the 
aging  sensors  as  introduced  in  the  previous  subsection,  in 
order  to  detect  the  natural  aging  as  well  as  malicious  aging. 

In  addition  to  the  natural  aging,  increased  Vth  due  to  our 
intentional  aging  may  be  recovered  by  the  NBTI  recovery. 
In  this  case,  the  improved  statistical  property  of  the  PUFs 
may  be  worsened  again.  However,  the  recovered  Vth  by 
the  NBTI  recovery  cannot  reach  the  original  Vth  due  to 
the  partial  recovery  of  the  NBTI  [7],  In  case  that  a  high 
degree  of  natural  NBTI  recovery  is  expected,  one  can  make 
that  the  partial  recovery  from  NBTI  hardly  affects  the 
PUF  responses  (e.g.,  further  conducting  the  aging  process 
considering  the  expected  partial  recovery).  Note  that  quan¬ 
tifying  the  impact  of  NBTI  recovery  and  investigating  a 
detailed  mechanism  for  the  aging  process  considering  the 
NBTI  recovery  are  out  of  the  scope  of  this  paper.  We  leave 
them  as  our  future  work. 

V.  Evaluation 

A.  Evaluation  setup 

Our  evaluation  results  are  based  on  an  accurate  gate-level 
delay  simulation  framework.  We  gave  a  threshold  voltage 
(Vth)  variation  to  each  chip  instance.  Though,  in  the  case 
of  Vth,  random  variation  is  known  to  be  more  remarkable 
than  systematic  variation  [1],  Vth  distributions  are  also 
spatially  correlated  [2]  [21]  [26].  Hence,  we  used  a  quad¬ 
tree  process  variation  model  [1]  to  precisely  model  both 
the  random  and  systematic  variation  of  the  Vth-  Assuming 


TABLE  III 

Average  inter-chip  Hamming  distance  results. 


Mean  1 

3xcr//i 

Vth,  x  20% 

Vth  x  30% 

VthX  40% 

32-bit  output 

12.38  (38.69%) 

12.21  (38.16%) 

12.48  (39.00%) 

64-bit  output 

22.67  (35.42%) 

21.92  (34.25%) 

22.92  (35.81%) 

Standard  deviation  1 

3  xcr//i 

Vth,  x  20% 

Vth  x  30% 

Uh  x  40% 

32-bit  output 

2.76 

2.75 

2.76 

64-bit  output 

3.83 

3.80 

3.84 

Vth  distributions  in  a  chip  follow  the  normal  distribution 
N(p,  er2),  we  considered  three  different  process  variation 
severities:  3xa/p=Vthx20%,  VthX 30%,  and  1^x40%. 
We  generated  1,000  different  chip  instances  for  our  Monte 
Carlo  simulations  for  each  process  variation  severity.  Our 
ALU  (adder)  model  is  based  on  the  Xilinx  fast  ripple-carry 
adder  model  [27].  The  placement  information  is  used  to 
map  Vth  parameters  (generated  by  [1])  to  each  gate  in 
a  chip.  We  obtained  a  nominal  gate  delay  from  HSPICE 
circuit  simulations  with  45nm  process  technology.  By  using 
Equation  1,  we  figured  out  delay  of  each  gate  according  to 
Vth  of  that  gate.  Note  that  we  exclude  the  simulation  results 
when  the  additional  XOR  obfuscation  step  is  deployed  since 
it  is  not  essential  but  optional  in  our  PUF  design. 

B.  Statistical  results  for  two-core  PUFs 

1 )  Inter-chip  variations:  In  this  subsection,  we  present 
inter-chip  variation  results  of  our  two-core  PUF  for  both 
32-bit  and  64-bit  ALUs.  To  quantify  the  inter-chip  varia¬ 
tions,  we  measure  inter-chip  Hamming  distances  between 
different  PUF  instances  when  we  feed  the  same  challenge 
program  to  both  cores.  We  show  the  average  inter-chip 
Hamming  distance  results  from  1,000,000  different  chal¬ 
lenge  inputs  (challenge  programs). 

Table  III  shows  the  inter-chip  Hamming  distances  (mean 
and  standard  deviation)  across  three  difference  process 
variation  severities.  As  Table  III  suggests,  uniqueness  of 
our  two-core  PUF  is  comparable  to  the  existing  strong  PUF 
designs  [4].  On  average,  the  inter-chip  Hamming  distance 
is  12.35  bits  (38.61%  -  ideally  50%)  within  the  32-bit  re¬ 
sponses.  Regardless  of  process  variation  severities  (3  xo/p, 
=Vthx20%,  30%,  and  40%),  the  inter-chip  variations  are 
shown  to  be  around  38%.  It  means  that  our  two-core  PUF 
can  be  a  low-overhead  alternative  for  conventional  strong 
PUFs. 

We  also  provide  the  results  for  64-bit  two-core  PUF 
design  since  many  commodity  microprocessors  are  using 
the  64-bit  datapath.  Across  three  process  variation  severity 
cases,  the  average  inter-chip  Hamming  distance  is  22.51 
bits  (35.16%). 

2)  Intra-chip  variations:  In  this  subsection,  we  present 
intra-chip  variation  results  under  various  environmental 
circumstances.  We  consider  three  cases  that  can  affect  the 
intra-chip  variations:  voltage  variation,  temperature  varia¬ 
tion,  and  arbiter  metastability. 

Table  IV  shows  intra-chip  variation  results.  To  estimate 
the  intra-chip  variations,  we  first  give  the  same  challenge  to 
the  same  chip  by  1 1  times  and  collect  the  PUF  responses. 
Each  of  11  PUF  operations  is  performed  under  random 
environmental  conditions  (the  voltage  and  temperature  are 
randomly  selected  within  the  range  of  1.0V-1.2V  and  253K- 
393K,  respectively).  Arbiter  metastability  also  generates 
some  noise  to  PUF  responses.  Intra-chip  Hamming  distance 
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TABLE  IV 

Average  intra-chip  Hamming  distance  variation  results 
UNDER  1,000,000  DIFFERENT  CHALLENGE  INPUTS. 


Mean 

3xcr//i 

Vthx20% 

Vth  x  30% 

Vth  x  40  % 

32-bit  output 

3.59  (11.22%) 

2.77  (8.66%) 

1.36  (4.25%) 

64-bit  output 

6.76  (10.56%) 

5.06  (7.91%) 

3.31  (5.17%) 

Standard  deviation 

3  xa/n 

Vth  x  20% 

Vth  x  30% 

Vthx  40% 

32-bit  output 

1.79 

1.59 

1.14 

64-bit  output 

2.46 

2.16 

1.77 

TABLE  V 

Average  inter-chip  Hamming  distance  results  before  and 

AFTER  OUR  INTENTIONAL  AGING  PROCESS. 


Normal  case 

Extreme  case 

Before  aging 

12.47  (38.96%) 

5.79  (18.10%) 

After  aging 

14.47  (45.23%) 

13.58  (42.45%) 

results  are  collected  under  1,000,000  different  challenge 
inputs  and  also  averaged  out  to  obtain  the  final  results. 

As  shown  in  Table  IV,  the  average  intra-chip  Hamming 
distances  are  2.58  bits  and  5.04  bits  (8.05%  and  7.88%  - 
ideally  0%),  in  the  case  of  32-bit  ALUs  and  64-bit  ALUs, 
respectively.  Since  our  PUF  design  is  based  on  the  delay 
comparison  mechanism  of  the  arbiter,  most  of  the  intra¬ 
chip  variations  are  due  to  the  arbiter  metastability.  One 
thing  worth  noting  is  the  intra-chip  Hamming  distances 
under  more  severe  process  variation  scenarios  tend  to  be 
lower.  This  is  because  the  delay  differences  between  two 
symmetric  delay  paths  tend  to  be  higher  under  severe 
process  variations,  which  makes  our  PUF  more  robust  under 
a  certain  level  of  the  environmental  variations. 

Though  the  intra-chip  Hamming  distances  under  less 
severe  process  variation  scenarios  may  seem  to  be  non- 
negligible,  it  can  be  alleviated  by  our  post-silicon  inten¬ 
tional  aging  algorithms.  The  post-silicon  tuning  results  for 
intra-chip  variation  reduction  will  be  presented  in  Section 
V-C2. 

C.  Statistical  results  for  post-silicon  tuning 

1)  For  inter-chip  variation  improvement:  To  figure  out 
the  effectiveness  of  our  proposed  intentional  aging  method, 
we  performed  a  Monte  Carlo  simulation  with  1,000  differ¬ 
ent  chip  instances.  The  process  variation  severity  is  3 xa/ p 
=VthX 30%.  In  this  subsection,  we  provide  two  different 
practical  cases  to  apply  our  intentional  aging  method  to 
increase  inter-chip  variations.  The  first  is  a  normal  case 
where  the  delay  of  two  cores  are  not  significantly  biased. 
The  other  is  an  extreme  case  in  which  the  delay  of  two 
cores  is  significantly  biased  [19].  For  the  extreme  case,  after 
generating  the  chip  instances  using  the  quad-tree  process 
variation  model,  we  gave  an  additional  5%  delay  bias  effect 
between  two  cores  in  each  chip  instance  so  that  the  ALU 
in  one  core  tends  to  be  faster  than  that  in  the  other  core. 

Table  V  shows  average  inter-chip  Hamming  distance 
(HD)  results  before  and  after  our  intentional  aging  process. 
Before  applying  our  intentional  aging  process,  the  baseline 
(before  aging)  inter-chip  HD  is  12.47  (12.47/32=38.96%  - 
50%  is  an  ideal  case)  and  5.79  (5.79/32=18.10%)  for  the 
normal  and  extreme  case,  respectively.  It  is  quite  natural 
that  the  inter-chip  HD  for  the  extreme  case  is  lower  than 
that  for  the  normal  case  because  it  has  a  higher  possibility 
not  to  have  unique  responses  but  to  have  biased  responses 


Fig.  10.  Average  inter-chip  Hamming  distance  results  with  regard  to  the 
number  of  iterations  ( k )  in  Alg.  1. 


Fig.  1 1 .  Average  intra-chip  Hamming  distance  results  with  regal'd  to  the 
number  of  iterations  ( k )  in  Alg.  2. 


across  the  different  chip  instances.  For  the  normal  case, 
after  applying  Alg.  1  with  k= 20,  the  inter-chip  HD  be¬ 
comes  14.47  (14.47/32=45.23%),  which  means  uniqueness 
of  the  PUF  responses  across  different  chips  is  significantly 
improved.  For  the  extreme  case,  the  average  inter-chip  HD 
becomes  up  to  13.58  (13.58/32=42.45%).  It  implies  that 
our  intentional  aging  method  makes  the  chip  instances 
practically  usable  even  in  the  case  that  a  significant  delay 
bias  exists  between  two  cores  due  to  systematic  process 
variations. 

To  see  how  one  can  determine  the  parameter  k  in 
practice,  we  also  show  a  trend  of  the  inter-chip  HD  as 
we  increase  k  in  Figure  10.  After  7  iterations  (i.e.,  Vth 
increase  by  at  most  0.07V  via  our  intentional  aging),  the 
inter-chip  HD  is  almost  saturated.  In  other  words,  only 
with  7  iterations  one  can  gain  the  maximum  obtainable 
uniqueness  for  a  certain  set  of  the  chip  samples. 

2)  For  intra-chip  variation  reduction:  Figure  11  shows 
effectiveness  of  Alg.  2.  Before  applying  Alg.  2,  the 
average  intra-chip  Hamming  distance  (HD)  is  3.77 
(3.77/32=11.78%  -  0%  is  an  ideal  case),  which  implies  there 
exist  Hamming  distances  of  3-4  bits  upon  the  repetitive 
measurements  with  the  same  challenge.  However,  after 
applying  Alg.  2  to  each  chip  with  k= 20,  the  intra-chip 
HD  is  reduced  to  0.07  (0.07/32=0.26%),  which  implies 
a  significant  the  intra-chip  HD  reduction.  With  only  3 
iterations  of  our  algorithm  (i.e.,  Vth.  increase  by  at  most 
0.03V  via  our  intentional  aging),  one  can  get  the  the  intra¬ 
chip  HD  values  below  1.0,  which  implies  that  there  is  an 
average  of  at  most  only  one -bit  Hamming  distances  within 
the  32-bit  responses  upon  the  repetitive  measurements.  In 
this  case,  one  can  deploy  a  light-weight  error  correction 
method  (e.g.,  single  error  correction  double  error  detection) 
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Fig.  12.  Average  intra-chip  Hamming  distance  of  bit  i  (i=l-32)  results 
within  32-bit  responses  before  and  after  applying  Alg.  2  with  k= 20. 


instead  of  the  high  overhead  error  correction  methods  such 
as  BCH  coding  [5]  [6].  If  the  PUF  designers  want  to  make 
the  PUF  responses  in  terms  of  intra-chip  variations  more 
robust,  it  is  possible  to  apply  a  higher  k  in  Alg.  2. 

Figure  12  shows  the  average  intra-chip  HD  per  bit  results 
within  the  32-bit  responses  before  and  after  applying  Alg. 
2  with  k= 20.  It  is  measured  by  slicing  the  32-bit  response 
into  each  bit  and  measuring  the  HD  for  each  bit  i  of  the 
PUF  responses  in  the  intra-chip  sample  spaces.  Though 
the  average  intra-chip  HD  per  bit  depends  on  the  delay 
characteristics  of  each  chip,  after  applying  Alg.  2,  the 
intra-chip  HD  of  bit  i  becomes  under  0.01  for  all  is  (1- 
32),  which  means  the  bit  responses  become  very  stable 
across  different  environmental  conditions.  On  average,  our 
algorithm  reduces  the  average  intra-chip  HD  of  bit  i  from 
0.11  to  0.002  after  20  iterations  of  the  intentional  aging 
process,  which  means  the  average  intra-chip  HD  per  bit  is 
reduced  by  98%. 

VI.  Related  Work 

A.  Physically  Unclonable  Functions  (PUFs) 

A  plausible  method  for  unique  and  unclonable  identifica¬ 
tion  of  devices  and  objects  is  based  on  the  inherent  and  hard 
to  forge  randomness  or  disorder  of  their  underlying  physical 
fabrics.  To  overcome  the  exposure  associated  with  storage 
of  digital  keys,  a  novel  class  of  secret  embedding,  storage, 
and  extraction  widely  known  as  PUFs  has  emerged.  The  se¬ 
cret  generation  and  storage  mechanisms  in  PUFs  are  based 
on  the  inherent  disorder  present  in  the  silicon  [3].  Memory- 
based  PUFs,  which  are  a  type  of  weak  PUFs  [4] [28] [29], 
are  typically  used  for  secure  key  storages.  Arbiter  PUFs 
[6],  that  are  known  to  belong  to  the  strong  PUF  family,  are 
composed  of  a  series  of  switches  (MUXes),  which  change 
delay  paths  according  to  the  input  challenge  bits.  For  better 
statistical  properties  and  to  make  the  structure  resilient  to 
modeling  attacks,  different  PUF  outputs  can  also  be  XOR- 
ed  [30].  Ring-oscillator  (RO)  PUFs  [6]  are  composed  of  a 
long  chain  of  inverters.  Glitch  PUFs  [31]  exploit  a  glitch 
propagation  variability  along  the  delay  paths.  In  [5]  and 
[32],  the  PUF  structures  combined  with  microprocessor 
architecture  are  proposed.  Apart  from  PUF  design  studies, 
there  exists  work  in  literature  on  detailed  analysis  [33]  [34], 
formal  models  [35],  and  modeling  attacks  on  PUFs  [16]. 

In  this  work,  we  proposed  a  new  strong  PUF  design, 
which  is  fundamentally  based  on  delay  comparison  between 
two  symmetric  paths  by  using  arbiters.  Our  PUF  design  is 
instruction-controlled,  and  leverages  built-in  components, 
i.e.,  arithmetic  logic  unit  (ALU)  in  a  classic  processor 
architecture  for  path  delay  sources  instead  of  deploying 
separate  delay  sources  as  presented  in  [6]. 


B.  Leveraging  aging  to  PUFs  and  circuits 

Circuit  aging  is  a  common  mechanism  by  which  perfor¬ 
mance  of  the  circuits  is  degraded  as  they  are  used.  Though  a 
large  body  of  work  for  aging  resilience  in  circuit  structures 
has  been  studied,  in  this  paper,  we  focus  on  the  case  where 
one  leverages  the  intentional  aging  of  the  PUFs  for  tuning 
the  statistical  properties  of  the  PUF  responses.  Reference 

[36]  provided  the  first  set  of  formal  properties  for  the 
statistical  distribution  of  the  PUF  responses  in  terms  of  the 
inter-chip  and  intra-chip  variation. 

A  hardware  aging-based  software  metering  technique 

[37]  precisely  tracks  the  software  usage  by  feeding  the 
test  vectors  to  the  specific  circuit.  Device-aging  based  PUF 
design  [38]  leverages  aging  mechanism  to  shape  the  PUF 
responses.  It  can  also  be  used  for  a  graduation  of  the  PUF 
responses  which  is  robust  to  PUF  modeling  attacks  or  for 
better  statistical  properties  of  the  PUF  by  changing  the  PUF 
responses.  Public  PUFs  (PPUFs)  [39][40][41]  leverage  the 
aging  to  shape  the  PUF  responses.  The  main  purpose  of 
applying  aging  to  PPUFs  is  to  make  the  responses  of  the 
PUFs,  which  are  shared  among  the  trusted  parties,  iden¬ 
tical  for  low-power  consumption  and  fast  authentication. 
Leveraging  intentional  aging  for  generating  stable  outputs 
in  SRAM  (static  random  access  memory)  PUFs  was  also 
proposed  [42].  Negative  bias  temperature  instability  (NBTI) 
aging  mechanism  enables  a  more  stable  output  generation 
from  SRAM  PUFs. 

To  the  best  of  our  knowledge,  our  work  is  the  first  to 
introduce  systematic  aging  of  a  strong  PUF  (two-core  PUF) 
to  get  a  better  statistical  distribution  of  PUF  responses 
(i.e.,  signatures)  both  in  terms  of  inter-chip  and  intra-chip 
variations. 

VII.  Conclusion 

In  this  paper,  we  proposed  a  two-core  strong  PUF  archi¬ 
tecture.  Our  design  is  low  overhead  and  robust  to  systematic 
variations  because  of  its  inherently  symmetric  construction. 
To  improve  the  statistical  distribution  of  the  PUF  outputs, 
we  devised  a  novel  intentional  aging  algorithm  which 
makes  the  PUF  instances  much  more  secure  and  stable  in 
terms  of  both  inter-  and  intra-chip  variations.  Our  evaluation 
results  suggest  that  our  proposed  algorithms  greatly  im¬ 
prove  the  quality  of  the  PUF  challenge-response  statistical 
properties.  By  applying  the  algorithm  to  increase  inter-chip 
variations,  one  can  obtain  the  PUF  responses  which  have 
higher  uniqueness  across  different  chip  instances.  Also,  the 
algorithm  to  reduce  intra-chip  variations  make  our  PUF 
much  more  robust  to  the  environmental  fluctuation,  which 
also  enables  a  deployment  of  low  overhead  error  correction 
schemes  for  robustness  and  stability  of  our  PUFs. 
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